Golang Job: Site Reliability Engineer Lead

Job added on

Location

Plano, Texas - United States of America

Job type

Full-Time

Golang Job Details

The team is looking for a Contract Senior Site Reliability Engineer to join our dynamic and fast-paced team. The ideal candidate will have extensive experience in managing large-scale microservice based systems, ensuring high availability, and implementing best practices in reliability engineering. You will work closely with development and operations teams to enhance our infrastructure and improve system performance while being mindful of cost-effectiveness.

Responsibilities:

  • Proactively identify performance improvements in areas such as responsiveness, availability, and scalability.
  • Establish best practices around topics like observability, monitoring and incident response and drive adoption across the organization.
  • Lead incident response efforts and conduct post-mortem analyses to prevent future occurrences.
  • Coordinate with Software Engineering and DevOps teams to design, implement, and maintain scalable and reliable systems using Kubernetes, Docker, and Istio.
  • Monitor system performance and troubleshoot issues proactively, utilizing Datadog for observability.
  • Implement and tune Horizontal Pod Autoscalers (HPAs) to optimize resource utilization.
  • Develop and maintain automation tools for deployment, monitoring, and incident response.
  • Collaborate with software engineering teams to improve system reliability and performance.
  • Implement A/B deployments, canary deployments, and traffic mirroring strategies to ensure critical updates go smoothly and can be rolled back with minimal impact if necessary.
  • Mentor junior engineers and contribute to team knowledge sharing.
  • Oversee and coordinate with SREs in other parts of the world, ensuring effective collaboration during on-call rotations.
  • Establish and enforce best practices for system reliability and performance across the organization.
  • Utilize Helm charts for application deployment and management.
  • Understand and implement AWS systems, including AWS Load Balancers and routing, to support systems handling millions of requests per hour.
  • Participate in on-call rotations and provide support for production systems.

Required Qualifications:

  • 5+ years of production experience working as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
  • Demonstrated ability to deliver highly available solutions at scale.
  • Demonstrates advanced problem-solving, troubleshooting, decision making skills
  • Expertise in containerization technologies (Docker, Kubernetes, and Istio) to build, package, and deploy optimized container images
  • Expertise in AWS
  • Experience with Argo CD for continuous delivery and GitOps practices.
  • Proficiency in monitoring and alerting tools, particularly Datadog, AppDynamics, ELK, Grafana, or Prometheus.
  • Familiarity with A/B, Canary, Blue/Green deployments, and traffic mirroring techniques.
  • Experience with scripting and orchestration tools such as Terraform, Ansible, or equivalent.
  • Demonstrated ability to balance cost considerations with performance and reliability.
  • Experience delegating tasks to junior engineers
  • Experience in leading initiatives under direction
  • Ability to apply systems thinking to understand interdependencies and design solutions that achieve results
  • Ability to learn and apply new technologies, programming practices, patterns, and methods
  • Experience mentoring, providing technical guidance, and training more junior team members
  • Ability to work independently and take ownership of tasks/assignments
  • Organized and detail-oriented
  • Ability to develop healthy working relationships and collaborate with peers and leaders
  • Exhibits integrity and high standards in work quality
  • Excellent verbal and written communication skills
  • Proficiency in Golang or Rust are both a plus but not required.
  • Values diversity and differences amongst individuals in interactions
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.